虚拟化CPU
目标
让用户觉得自己的单核CPU电脑能够同时运行多个程序
Mechanism
- 两个问题
- 当前哪一个进程获得CPU(scheduling policy CPU调度)
- 进程之间怎么切换
进程调度
在进程切换的时候,OS应该选择哪一个进程占用CPU才是合理且高效的呢?
Policies
Assumptions
- Each job runs for the same amount of time.
- All jobs arrive at the same time.
- Once started, each job runs to completion.
- All jobs only use the CPU (i.e., they perform no I/O)
- The run-time of each job is known.
Targets:
Scheduling policies
First In, First Out (FIFO)
Shortest Job First (SJF)
- 在假设1不成立的时候依然能够很好的工作
- 如果假设2不成立
Shortest Time-to-Completion First (STCF)
- 在假设1,2,3不成立的时候依然能够很好的工作
- 但是对于现代的分时操作系统来说,除了turnaround之外,响应时间或许是一个更加重要的指标
多级反馈队列
多级反馈队列是一种实际运用相对较多的调度算法,在Windows NT内核中都有用到,我们单独拿出来讲
虚拟化内存
- 为什么我们需要虚拟化内存?
Mostly ease of use: the OS will give each program the view that it has a large contiguous address space to put its code and data into; thus, as a programmer, you never have to worry about things like “where should I store this variable?” because the virtual address space of the program is large and has lots of room for that sort of thing. Life, for a programmer, becomes much more tricky if you have to worry about fitting all of your code data into a small, crowded memory—《Operating System: There easy pieces》
目标
- transparency
- efficiency
- protection
Mechanism
- Address Translation()
- Dynamic (Hardware-based) Relocation
- Dynamic Relocation 基于硬件,在程序运行时发生,程序在中断之后,可以把虚拟地址映射到另外一个与之前不同的物理地址空间,并且提供进程之间的保护
- static(software-based) relocation
- static relocation 基于软件,在程序运行之前完成,不支持在运行时虚拟地址的重新映射,不提供进程之间的保护


Policies
Naive solution
Segmentation
paging
为了解决分段带来的外部fragmentation问题,分页技术把内存分成大小相同的页

- The simplest form is called a linear page table, which is just an array. The OS indexes the array by the virtual page number (VPN), and looks up the page-table entry (PTE) at that index in order to find the desired physical frame number (PFN)
- address translation procedure
- problems:
- efficiency
- time: 访问两次内存
- space: 每个进程一个页表,占用太多的内存空间
speed-up page table(TLB: translation-lookaside buffer)
save memory space
- Bigger Pages
- problems:
- big pages lead to waste within each page(internal fragmentation)
- Hybrid Approach: Paging and Segments
- use the base not to point to the segment itself but rather to hold the physical address of the page table of that segment
- unallocated pages between the stack and the heap no longer take up space in a page table
- translation procedure
- problems:
- have a large but sparsely-used heap, for example, we can still end up with a lot of page table waste
- page tables now can be of arbitrary size
- Multi-level Page Tables
- page table of page table

- The page directory thus either can be used to tell you where a page of the page table is, or that the entire page of the page table contains no valid pages
- translation procedure
Inverted Page Tables
- keep a single page table that has an entry for each physical page of the system. The entry tells us which process is using this page, and which virtual page of that process maps to this physical page.
Swapping the Page Tables to Disk
- place such page tables in kernel virtual memory, thereby allowing the system to swap some of these page tables to disk when memory pressure gets a little tight
swapping
- swapping技术的存在同样是服务于primal goal
- 在PTE中有一个present bit字段,如果这个字段为0,表示当前页不在内存中,也就是说产生了一个page fault


Replacement algorithm
并发
线程
- a new abstraction for a single running process: that of a thread
- a multi-threaded program has more than one point of execution (i.e., multiple PCs, each of which is being fetched and executed from)
- context switch(TCBs)
- KEY CONCURRENCY TERMS
- goals
- 引入线程的目的是希望我们的程序fast and correct
- 怎么保证correct?
- 怎么实现mutual exclusion?
- 同步原语(synchronization primitives)
同步原语
同步原语是什么?
- 像锁、信号量、条件变量等都是同步原语
- 锁、条件变量和信号量之间的区别和联系是怎样的?
- 锁是用来实现互斥访问的基本同步原语
- Semaphores and condition variables build on top of the mutual exclusion provide by locks and are used for providing synchronized access to shared resources. They can be used for similar purposes.
- A condition variable is generally used to avoid busy waiting
- 信号量能够同时完成锁和条件变量的功能
- binary semaphore并且初始值是1的时候相当于锁
- binary semaphore并且初始值是0的时候相当于条件变量
- 虽然信号量能完成锁和条件变量能完成的事,但是二者之间的联系并不是那么紧密,建议从信号量的语义出发理解并使用信号量
锁在底层是怎么实现的?
- 在实现锁的时候我们需要考虑的因素有哪些?
- mutual exclusion – 基本要求
- fairness
- performance
- 我们可以借助硬件开关中断的方式实现锁
-
- 但是开关中断的方法有寄个缺点:
- this approach requires us to allow any calling thread to perform a privileged operation
- the approach does not work on multiprocessors
- turning off interrupts for extended periods of time can lead to interrupts becoming lost
- 还有一系列的方式可以用来实现锁(自旋锁、队列锁)
- linux系统使用的是一种叫做两段锁的锁
用锁实现thread-safe的数据结构
- counter
- 实现一个线程安全的计数器很简单
- 效率更加高效的sloppy counter

- 但是这种实现存在一个performance/accuracy tradeoff
- 对上图代码一个小小的改进就是在get方法中首先对每个线程调用update方法,更近global counter之后再返回global counter,但是这种方法依然不能保证百分之百的精确(想一想为什么)
- 链表

- 我们在实现的时候需要特别小心流程控制处锁的获得和释放。
A general design tip, which is useful in concurrent code as well as elsewhere, is to be wary of control flow changes that lead to function returns, exits, or other similar error conditions that halt the execution of a function. Because many functions will begin by acquiring a lock, allocating some memory, or doing other similar stateful operations, when errors arise, the code has to undo all of the state before returning, which is error-prone. Thus, it is best to structure code to minimize this pattern.
所以我们可以对以上的代码进行改进,得到一份更加鲁棒的concurrent链表的实现
并行队列
- 不像单链表只能从表头开始操作,队列可以从队头和队尾同时进行操作,所以我们可以对对头和队尾分别加锁

- 注意这份代码中使用了一个哨兵节点,这也是我们在实现普通队列的时候应该做的(还记得判断队列为空还是为满的条件吗?)
并行哈希表

- 使用前面建立的并行链表,我们就能实现一个非常高效的并行哈希表
- 当然使用的是开放式冲突解决策略中的独立链表法
使用同步原语解决一些问题
- 生产者/消费者问题
- 下面这份代码只能支持单线程运行producer和consumer函数
- 下面这份代码才是真正的多线程解决方案
- 读者/写者问题
- 使用基本的lock和condition variable
- 我们需要一个numReaders变量,记录当前读者的个数,通过这个变量的值来使用条件变量,然后还需要一个lock,用来互斥访问numReaders
- 使用信号量
- 
- 哲学家就餐问题

- 打破死循环
- 
- event-based concurrency
- Common Concurrency Problems
持久化
I/O设备
- A Canonical Device
- The Canonical Protocol

- polling
- Lowering CPU Overhead With Interrupts
- difference between interrupts and polling
- if a device is fast, it may be best to poll; if it is slow, interrupts
- If the speed of the device is not known, or sometimes fast and sometimes slow, it may be best to use a hybrid that polls for a little while and then, if the device is not yet finished, uses interrupts.
- DMA
- using programmed I/O (PIO) to transfer a large chunk of data to a device, the CPU is once again overburdened with a rather trivial task
- without DMA(c表示从内存copy内容到设备)

- with DMA
- Methods Of Device Interaction
- explicit I/O instructions(in and out)
- memorymapped I/O
- 操作系统是怎么保证不同的设备插上之后就能即插即用的呢?
- incorporate各种设备的驱动程序

Interlude: File and Directories
- virtualization of storage
file
- Creating Files
- Reading and Writing Files
- Reading And Writing, But Not Sequentially
- Writing Immediately with fsync()


- you also need to fsync() the directory that contains the file foo. Adding this step ensures not only that the file itself
is on disk, but that the file, if newly created, also is durably a part of the directory
- Renaming Files
- atomicity of renaming


- 在上面的例子中,如果我们能实现rename操作的原子性,我们就能实现文件更新操作的原子性
- Getting Information About Files
- stat() or fstat()

- Removing Files
directories
- Making Directories
- the format of the directory is considered file system metadata, you can only update a directory indirectly by, for example, creating files, directories, or other object types within it

- Reading Directories
- opendir(), readdir(), and closedir()


- Deleting Directories
- rmdir()
- rmdir() has the requirement that the directory be empty (i.e., only has “.” and “..” entries) before it is deleted.
Hard Links and Soft Links
- Hark Link,两个文件指向同一个inode
- 每个inode有一个reference count项,表示指向这个inode的硬链接的数目
- can’t create one to a directory
- can’t hard link to files in other disk partitions
- Soft Link,软链接文件是一种新类型的文件,里面存储的是指向的文件的完整路径
- dangling reference problem
- 在*nix系统下创建hard link或者soft link,使用ln命令
怎么组织同一个目录下的文件或者子目录?
现代操作系统一般使用B树。
对于目录和纯文件来说,二者存储的信息有什么区别?
纯文件,存储的是文件的内容,对于目录,存储的是目录下的子目录和文件以及它们对应的inode number。
Very Simple File System(VSFS)
Locality and The Fast File System
VSFS的效率太低了,我们需要更加快速的操作系统,Fast File System (FFS)。
Crash Consistency: FSCK and Journaling

Solution #1: The File System Checker
- Early file systems took a simple approach to crash consistency. Basically, they decided to let inconsistencies happen and then fix them later (when rebooting). A classic example of this lazy approach is found in a tool that does this: fsck.
- Note that such an approach can’t fix all problems; consider, for example, the case above where the file system looks consistent but the inode points to garbage data. The only real goal is to make sure the file system metadata is internally consistent.
- problem
- too slow(O(size-of-the-disk-volume))
Solution #2: Journaling (or Write-Ahead Logging)
- ext2 without Journaling
- ext3 with Journaling
Data Journaling
在下面的示例中,我们需要更新某个文件的内容,相应的会写一部分用户数据(Db),更新bitmap(B[v2]),更新对应的inode(I[V2])
Attempt 1
- log structure
- protocol
- problem
Attempt 2
- log structure
- protocol

- disk guarantees that any 512-byte write will either happen or not. 如果TxE写完了,就说明这条log被正确记录下来,所以我们需要保证写TxE的原子性,结果就是让TxE的大小刚好是512-byte。
- problem:
- journaling space is finite
- Batching Log Updates
- some file systems do not commit each update to disk one at a time (e.g., Linux ext3); rather, one can buffer all updates
into a global transaction
Attempt 3
- circular log
- protocol
- problem
- still too slow, write data block twice
Attempt 4
- log structure
- protocol

- the only real requirement is that Steps 1 and 2 complete before the issuing of the journal commit block
rule:
write the pointed to object before the object with the pointer to it
Solution #3: Other Approaches
- copy-on-write (yes, COW)
- This technique never overwrites files or directories in place; rather, it places new updates to previously unused locations on disk. After a number of updates are completed, COW file systems flip the root structure of the file system to include pointers to the newly updated structures
- backpointer-based consistency (or BBC)
- an additional back pointer is added to every block in the system; for example, each data block has a reference to the inode to which it belongs. When accessing a file, the file system can determine if the file is consistent by checking if the forward pointer (e.g., the address in the inode or direct block) points to a block that refers back to it.